Fusion/fission protein family identification in Archaea

ABSTRACT The majority of newly discovered archaeal lineages remain without a cultivated representative, but scarce experimental data from the cultivated organisms show that they harbor distinct functional repertoires. To unveil the ecological as well as evolutionary impact of Archaea from metagenomics, new computational methods need to be developed, followed by in-depth analysis. Among them is the genome-wide protein fusion screening performed here. Natural fusions and fissions of genes not only contribute to microbial evolution but also complicate the correct identification and functional annotation of sequences. The products of these processes can be defined as fusion (or composite) proteins, the ones consisting of two or more domains originally encoded by different genes and split proteins, and the ones originating from the separation of a gene in two (fission). Fusion identifications are required for proper phylogenetic reconstructions and metabolic pathway completeness assessments, while mappings between fused and unfused proteins can fill some of the existing gaps in metabolic models. In the archaeal genome-wide screening, more than 1,900 fusion/fission protein clusters were identified, belonging to both newly sequenced and well-studied lineages. These protein families are mainly associated with different types of metabolism, genetic, and cellular processes. Moreover, 162 of the identified fusion/fission protein families are archaeal specific, having no identified fused homolog within the bacterial domain. Our approach was validated by the identification of experimentally characterized fusion/fission cases. However, around 25% of the identified fusion/fission families lack functional annotations for both composite and split states, showing the need for experimental characterization in Archaea. IMPORTANCE Genome-wide fusion screening has never been performed in Archaea on a broad taxonomic scale. The overlay of multiple computational techniques allows the detection of a fine-grained set of predicted fusion/fission families, instead of rough estimations based on conserved domain annotations only. The exhaustive mapping of fused proteins to bacterial organisms allows us to capture fusion/fission families that are specific to archaeal biology, as well as to identify links between bacterial and archaeal lineages based on cooccurrence of taxonomically restricted proteins and their sequence features. Furthermore, the identification of poorly characterized lineage-specific fusion proteins opens up possibilities for future experimental and computational investigations. This approach enhances our understanding of Archaea in general and provides potential candidates for in-depth studies in the future.


Genetic information processing
Within the genetic information processing category, the fusion/fission protein families functional annotations cover a range of DNA, RNA and protein processing pathways, with more than 30 protein clusters involved in DNA repair and replication alone.Although in these, fission events tend to prevails, several families, demonstrate more complicated patterns.This is the case of DNA polymerase B and DNA helicase Mcm protein families (classified as fissions) containing inteins.Inteins are part of the auto-spliced gene as the protein is expressed 1 .
Insertion of inteins can happen multiple times and eventually, lead to fission events between or within domains, which produce misleading functional annotations.In addition, two fusion events containing the uracil-DNA glycosylase domain were identified in unclassified Euryarchaeota and Thermoplasmatales.Furthermore, new insights on eucaryotic like DNA repair endonuclease XPF protein evolution can be provided through the identification of the corresponding split protein pairs within the DPANN superphyla mapping to Euryarchaeota composite proteins, that suggest the loss of the protein in Crenarchaeota.
Within transcription processes, less than ten fusion/fission events were identified.The two largest protein clusters from this category correspond to the transcription initiation factor TFB and the RNA polymerase subunit B. The first has sixteen split pairs and according to its taxonomic distribution and the domain architecture of the distantly homologous eukaryotic form 2 , might represent a fission event.In the case of the cluster containing RNA polymerase B proteins, on the contrary, several composite proteins corresponding to a fusion of the two subunits (rpoB1 and rpoB2) are found, and this cluster assigned as fusion.This composite form is mostly present in TACK and Asgard supergroups, while the split forms are widely distributed in Euryarchaeaota.The identified clusters associated with translation correspond to ribosomal proteins, ribosome biogenesis proteins and aminoacyl-tRNA biogenesis proteins.Clusters from the first two categories are predominated classified as fission.Nevertheless, we observed two singleton fusion events with high support from the conservation of the respective syntenic split pairs.The First corresponds to a fusion between the large ribosomal subunit protein L5 and the small ribosomal protein S4e, identified in many TACK and Euryarchaeota lineages as well as Ca.Aenigmarchaeota (DPANN).The second is a fusion between the small ribosomal protein S3 and the ribonuclease P protein subunit POP4, observed in single Ca.Bathyarchaeota assembly.As for tRNA biogenesis, clusters containing synthetases or with methyltransferases involved in amino acid metabolism represent taxonomic or assembly restricted fission/fusion events, are only identified in taxonomically unclassified lineages.The fusion partners of these enzymes are either cytochrome-containing or sulfur carrier proteins, reflecting perhaps, the recruitment of the module to perform other functions associated with group-transfers.
In the archaeosine synthase alpha-subunit protein family, we observe that the composite protein is present in Euryarchaeaota and Asgards, while in Crenarchaeota representatives the split versions of the protein is present.Within the protein folding and degradation category, a few clusters corresponding to fission events were found.However, we identified a singleton fusion event between a proteosome subunit and an uncharacterised universal archaeal protein in a Nitrosopumilus assembly.This fusion event has high support from the split protein side and hints for the possible involvement of the composite protein in folding processes.

Classification of complex protein families
The classification of a fusion/fission event can be complicated by the existence of homologous protein in the same cluster.For example, in the probable fusion cluster containing ABC transporters (cluster 19), several homologous are present.In this case, the fusion of two ATPbinding domains corresponds to the three functionally distinct proteins: the energy-coupling factor transporter (EcfA1 and EcfA2) 3 , the general nucleoside transporter 4 and the methylcoenzyme M reductase system (component A2) 5 .A final classification problem is the recurring lack of agreement between bacterial and archaeal domains on the fission/fusion state of a protein family.One of such cases refers to a fusion/fission event of hydroxymethylpyrimidine kinase/phosphomethylpyrimidine kinase, ThiD, and thiaminephosphate diphosphorylase (ThiN), involved in thiamine metabolism, a vitamin ubiquitous in prokaryotes 6 .In Bacteria, phosphorylation of the pyrimidine and condensation with thiazole is catalyzed by two separate enzymes, ThiD and ThiE (analogous to ThiN) 7,8 , with a small fraction of organisms containing a sequence resulting from a fusion of the two.In Archaea, on the other hand, the majority of organisms has the composite version of the enzymes (ThiDN), with only 19 syntenic pairs of split proteins found across different phyla, rather suggesting one or more fission events.number of fission events per genome (split protein count); b. the average number of "fusion and fission" events per genome (composite protein count);

Fig S6.
Presence and distribution of lineage-specific fusion/fission families in the composite state (class or higher).The bar chart to the left indicates the total counts of the fusion/fission families for the composite proteins.The box plot to the right shows the distribution of the fusion/fission families within the lineage representatives (assemblies).

Fig S1 -Fig S2
Fig S1 -Relation between split proteins (or split sets) and total number of proteins per assembly.a) Correlation between the number of high-confidence fission split sets and number of proteins per assembly.b) Correlation between the number of high-confidence fission split proteins and number of proteins per assembly.c) Correlation between total number of split sets and number of proteins per assembly.d) Correlation between the total number of split proteins and number of proteins per assembly.

Fig S4 .Fig S5 .
Fig S4.Gain in number of fusion/fission families with the addition of each assembly per lineage (order or higher).a. Gain in number of fusion/fission families for the syntenic splits in well-represented lineages.b.Gain in number of fusion/fission families for the composite state in well-represented lineages.c.Gain in number of fusion/fission families for the syntenic splits in poorly represented lineages.d.Gain in number of fusion/fission families for the composite state in poorly represented lineages.